adaptively aligned image captioning
Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time
Although the two techniques have been well explored individually, this is the first work combining it for attention for image captioning. This should make reproducing the results easier. The base attention model already is doing much better than up-down attention and recent methods like GCN-LSTM and so it's not clear where the gains are coming from. It'd be good to see AAT applied to traditional single-head attention instead of multi-head attention to convincingly show that AAT helps. For instance, how does the attention time steps vary with word position in the caption?
Reviews: Adaptively Aligned Image Captioning via Adaptive Attention Time
After feedback and reviewer discussion, this paper received final ratings of 6, 7 and 7. Although the novelty of the proposed model is relatively minor in the context of previous work proposing Adaptive Computation Time (Graves 2016), the reviewers were impressed by the empirical performance and praised the detailed ablation studies (including the additional experiments with single-headed attention in the author feedback, which was important in reaching the final consensus view of reviewers to accept this paper). We encourage the authors to follow the suggestion of R1 (cut down space devoted to standard captioning components in Secs 3.2.1,
Adaptively Aligned Image Captioning via Adaptive Attention Time
Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions.
Adaptively Aligned Image Captioning via Adaptive Attention Time
Huang, Lun, Wang, Wenmin, Xia, Yaxian, Chen, Jie
Recent neural models for image captioning usually employ an encoder-decoder framework with an attention mechanism. However, the attention mechanism in such a framework aligns one single (attended) image feature vector to one caption word, assuming one-to-one mapping from source image regions and target caption words, which is never possible. In this paper, we propose a novel attention model, namely Adaptive Attention Time (AAT), to align the source and the target adaptively for image captioning. AAT allows the framework to learn how many attention steps to take to output a caption word at each decoding step. With AAT, an image region can be mapped to an arbitrary number of caption words while a caption word can also attend to an arbitrary number of image regions.